Explainable sequence-to-sequence GRU neural network for pollution forecasting

The goal of pollution forecasting models is to allow the prediction and control of the air quality. Non-linear data-driven approaches based on deep neural networks have been increasingly used in such contexts showing significant improvements w.r.t. more conventional approaches like regression models and mechanistic approaches. While such deep learning models were deemed for a long time as black boxes, recent advances in eXplainable AI (XAI) allow to look through the model’s decision-making process, providing insights into decisive input features responsible for the model’s prediction. One XAI technique to explain the predictions of neural networks which was proven useful in various domains is Layer-wise Relevance Propagation (LRP). In this work, we extend the LRP technique to a sequence-to-sequence neural network model with GRU layers. The explanation heatmaps provided by LRP allow us to identify important meteorological and temporal features responsible for the accumulation of four major pollutants in the air (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {PM}_{10}$$\end{document}PM10, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {NO}_{2}$$\end{document}NO2, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {NO}$$\end{document}NO, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {O}_{3}$$\end{document}O3), and our findings can be backed up with prior knowledge in environmental and pollution research. This illustrates the appropriateness of XAI for understanding pollution forecastings and opens up new avenues for controlling and mitigating the pollutants’ load in the air.

Our main focus in this work is to explain the predictions of the pollution forecasting sequence-to-sequence model with GRU layers as was proposed in Petry et al. 3 , using the XAI method of LRP 7 , in order to demonstrate the usefulness of XAI for this task and domain. To this end, we extend the popular LRP technique to GRU layers and to a sequence-to-sequence recurrent architecture. This extension allows us to gain insights into which sets of input variables are the most relevant for predicting high concentration levels of various pollutants.
For instance, in the case of the pollutant NO (nitrogen monoxide), we find that the morning traffic hours play an important role in predicting high levels of the pollution, which is in agreement with results from previous work that show that NO displays a strong dependency on diurnal source emissions 8 , while for predicting the load of the other pollutants we considered in our work the hour of the day is less important, and other meteorological factors as well as historical measurements are instead determinant. Besides, for the O 3 (ozone) pollutant concentration, we find that the high predictions rely on warm temperature and low humidity, which is also consistent with wellestablished domain knowledge about the central role of solar radiation for ozone formation 9,10 , as well as well as prior work suggesting that dry deposition via trees might also be involved in regulating the O 3 concentration 11 .
Along the way, to ensure the correctness of our implementation, we verified the LRP relevance conservation property 12 , as well as validated the LRP input features' relevances using a toy task (similarly to Arras et al. 13 ). For reproducibility of our results we make our code publicly available (https:// github. com/ Sara-mibo/ LRP_ Encod erDec oder_ GRU). More broadly, our explainable GRU-based and sequence-to-sequence recurrent model can be employed on any other task or domain that relies on the same neural network components. Finally, we show that it is possible to simplify our pollution forecasting model's structure by inspecting the LRP results. This adjustment speeds up the training and inference by reducing the model's complexity with a negligible change in the prediction performance.

Related work
Pollution forecasting. Conventional approaches for pollution forecasting are based on numerical simulations taking into account the physical and chemical processes involved in the emission, diffusion and transport of air pollutants. These models are built upon specialized sub-modules that deal with the different pollution causes (such as natural and anthropogenic sources, chemical transformations, aerosol processes and microphysics, pollutant transport through wind and diffusion, dry and wet deposition), and they require sophisticated parameterizations 14 .
Alternative data-driven approaches for pollution forecasting, and more generally for Earth system science, have also been explored recently based on regression analysis, autoregressive statistical models and machine learning [15][16][17] . Among these models, non-linear models such as artificial neural networks improved the prediction accuracy over linear models, and compared to mechanistic approaches, they present the advantage of not requiring a precise knowledge of the underlying chemical and physical processes. However they typically necessitate a larger amount of historical data to be available [18][19][20][21][22][23] . Additionally, hybrid systems combining different methods, e.g. autoregressive statistical models with artificial neural networks 24 , have also been proposed 15 . Besides, in order to explicitly incorporate the spatial dependency between locations and the diffusion of pollutants into the model, some recent works used an averaging of data across neighboring locations as input 25 , or feed the data into the neural network model via a grid structure using a convolutional or a graph neural network 26,27 (which bears some similarity with the tasks of video and motion prediction 16 ). In our work we focus on a model that primarily consideres temporal data as input (and we modelize the spatial location as a static input feature), as this setup was found to reach sufficient performance in a real-world application 3 , and our main objective in the present work is to demonstrate the pertinence of XAI in the pollution forecasting domain and using a standard and widely-used neural network architecture for temporal data.
Among artificial neural networks, recurrent networks are particularly well-suited to modelize temporal data. Compared to feed-forward neural networks where one has to select the time window and averaging of the input variables, recurrent networks can directly receive the raw time series as input and learn the useful temporal dependencies from the training data 28 . LSTMs 29,30 and GRUs [31][32][33] are two popular such recurrent network architectures that further alleviate the vanishing gradient problem of vanilla recurrent nets through the introduction of memory cells and gating mechanisms, that allow them to learn longer-range temporal dependencies. For example in the field of hydrology, the standard LSTM model as well as novel LSTM variants, were used to predict the basin water discharge from rainfall measurements and other temporal meteorological variables (as well as static catchment attributes), and were shown to outperform high-quality basin-calibrated hydrological models, in particular for flood prediction [34][35][36] . On the task of pollution forecasting that we consider in the present work, the LSTM and GRU networks were already compared to one another and achieved similar prediction performance 3 , though the GRU model has less parameters, therefore we use a GRU-based model in our work.
Despite their high prediction performance, thus far artificial neural networks were not largely adopted in pollution forecasting due to their lack of interpretability 28,37,38 . Indeed, as opposed to regression-based models where one can readily obtain the contributions of environmental variables to the prediction task, neural networks are typically considered as black-boxes. Notwithstanding, in recent years several post-hoc XAI methods have been been proposed to inspect the decision process of neural networks: given a trained neural network, these methods are able to assign to each input variable a relevance score representing the contribution of that variable to a given prediction 5,39 . These methods not only facilitate transparency and trust in the model's decision, they potentially also lead to new scientific insights and discoveries 40  www.nature.com/scientificreports/ a feed-forward neural network, and successively removed the least significant features from the input to arrive at a simpler model with similar performance. In García et al. 42 the authors used the SHAP method to interpret the decisions of an LSTM that also predicts NO 2 concentration, and checked the compatibility of the results with previous intuitions and scientific domain knowledge. Another recent work by Yang et al. 43 analyzed the contribution of inputs upon recurrent networks trained on different sets of variables for the prediction of PM 2.5 using the SHAP method. In our work we will employ the XAI technique of LRP and verify the consistency of the results with available domain knowledge, as well as validate our results intrinsically by using the LRP relevances to simplify our model's architecture. More generally, among the existing post-hoc XAI methods for neural networks, one can essentially identify three main approaches. One is based on the gradient of the prediction function w.r.t. the input variables and includes Saliency [44][45][46] , Gradient times Input 47 and Integrated Gradients 48 . While gradients are easy to compute, they are known to be noisy and subject to high variance, this is why smoothing techniques based on random sampling such as SmoothGrad have been proposed 49 . However the introduced post-processing step induces additional hyperparameters and randomness in the explanation process 50 . Another line of approach which is also model-agnostic is based on solving an ad-hoc optimization problem to learn a local linear model using perturbed or masked versions of the original input, which includes the LIME 6 and the Kernel SHAP 51 methods. These sampling-based techniques are typically non-deterministic, expensive to compute, and it is unclear how to extend them to recurrent networks that present temporal dependencies in the input sequence, since their original formulations do not consider the order of the inputs 52 . Finally, another way of explaining decisions is based on the layer-wise decomposition of the prediction function and involves a custom backward propagation through the network that verifies an overall conservation principle. LRP 7,12 and Deep Taylor Decomposition 53 are prime examples of this approach.
The LRP explanation method presents the advantage of being fast to compute, since it requires only a single backward pass through the network, however it is not model-agnostic and requires a specific implementation for each novel type of neural network layer. Theoretically, the LRP propagation procedure can be justified via the mathematical framework of Deep Taylor Decomposition 53,54 . In terms of practical application, LRP already demonstrated its superiority over other XAI methods on convolution neural networks in the computer vision domain 4,50,55 , as well as on a targeted simulation in Geoscience inspired by remote sensing tasks where spatial patterns need to be extracted 56 . Its usefulness has also been demonstrated in the Earth science field, where it was able to identify physically meaningful patterns of climate variability using feed-forward networks trained on geospatial fields of sea surface temperature 57 . With the present work, we further showcase its suitability in the environmental domain for the task of pollution forecasting, using a sequence-to-sequence GRU-based neural network model.
The LRP technique was previously extended to recurrent networks such as LSTMs 58 , and the resulting explanations were evaluated against other existing XAI methods on toy tasks and in natural language processing 13,54,59 , where LRP was shown to deliver superior results. Further, this LRP extension to recurrent networks was successfully applied in the health domain for therapy prediction 60 and in computer security for discovering vulnerability in source code 61 . A recent work also applied the technique to a recurrent network for PM 2.5 forecasting, where it helped identifying a subset of input variables that lead to a similar prediction performance when the model was re-trained only on this subset 62 .
Therefore, we will employ the LRP explanation technique to explain our pollution forecasting model 3 . To this end we adapt the method from Arras et al. 58 for a GRU layer and to a sequence-to-sequence network architecture.

Pollution forecasting task
Dataset. We consider data collected from a total of 22 measuring stations located at the city of Stuttgart and the federal state of North Rhine-Westphalia (NRW) in Germany, during the years 2010 to 2020. These stations are located at different places including urban, suburban and rural areas with diverse emission sources. Table 1 represents the input features used for training our model, along with their source and unit. These input features include the concentration of four different pollutants, namely PM 10 , NO, NO 2 and O 3 (which are also the pollutants which are being forecasted), six meteorological features (temperature, precipitation, sunshine, humidity, wind direction and speed), five temporal features (year, month, day, hour, working vs. work-free day), as well as one static feature representing the station number. The temporal resolution (i.e. the time interval between two measurements/features) used for the inputs and the prediction is 1 h. The resulting collected dataset has more than 1 million data sequences of length 216 h (corresponding to 9 days) , which are distributed randomly into train, validation, and test sets with the proportions of 10/12, 1/12, 1/12.
During the forecast, the measured meteorological input features are replaced by weather forecasts, and the measured pollutant concentrations by the corresponding predicted ones in previous time steps.
In the original work by Petry et al. 3 , some features were used with a cyclic-feature encoding (sine and cosine). In order to facilitate the interpretation of the LRP relevance results, we changed the encoding of these features to one-hot encoding and trained a model with this new input representation. As an additional preprocessing step we normalized numerical features to mean zero and unit variance. Further details on the data collection can be found in the prior work of Petry et al. 3 .
Sequence-to-sequence model. The sequence-to-sequence pollution forecasting model we use in the present work is similar to the one introduced in Petry et al. 3 . It contains two recurrent neural network (RNN) modules with each two Gated Recurrent Units (GRU) layers of dimension 1024. The decoder module additionally contains a fully-connected linear layer of size 1024 followed by a ReLU activation, plus a linear output layer without bias. Discarding the bias from the output layer is important for the XAI decomposition, such that when www.nature.com/scientificreports/ explaining the model's predictions we are able to decompose the full output value into input variables' contributions (indeed most XAI methods, including LRP, ignore the output layer's bias in the relevance decomposition process). The output layer has size four, and we train one model to predict all four pollutants ( PM 10 , NO, NO 2 and O 3 ) at once. The resulting model is depicted in Fig. 1. The model is trained by minimizing the mean squared error, using the AdamW optimizer and a batch size of 128. The trained model has a mean absolute prediction error of 3.73 μg/m 3 on the validation set. Data is fed into the model sequentially 1 h at a time. The encoder receives as input the data of the last 7 days (168 h), which is called historical input (we refer to this data via the subscript h in the features). In each time step of the prediction, the decoder receives as input the weather forecast of the current hour, together with the pollution forecast of the last hour, plus the temporal input features and the static input, and makes a new prediction. www.nature.com/scientificreports/ The input data of the decoder module is called forecast input (we refer to this data using the subscript f in the features, or using the keyword intermediate in the case of the pollutant forecasts in prior time steps). The model provides a forecast of the pollutants particulate matter ( PM 10 ), nitrogen monoxide (NO), nitrogen dioxide ( NO 2 ) and ozone ( O 3 ) for the next 2 days (48 h). The model's predictions are denoted by ŷ in Fig. 1 (as opposed to y which denotes the true value of the pollutants). More information on the implementation details can be found in Petry et al. 3 .

Explaining predictions
Layer-wise relevance propagation (LRP). LRP is a method to explain the decisions of a neural network model by computing the contribution of each input feature towards the predicted output 7 . More precisely, LRP redistributes layer-by-layer relevance scores from the output layer until the input layer neurons using a layerwise relevance conservation property as shown in Eq. (1). The sum of the scores that neurons of the lower-layer receive from the upper-layer is conserved across all layers 7,12 (up to the relevance which will be absorbed by the hidden layer "bias neurons" which we depict through the index 0 in Eq. 1).
where f (x) is the model's prediction for one target output neuron of interest (in our case the prediction for one pollutant type), x is the model's input, R (l) d is the relevance of neuron d in layer l, and N l is the number of neurons (i.e., the dimension) of layer l. Further, we denote by R L 1 the relevance of the output neuron of interest, L being the final layer of the model, which is set equal to f (x) for initializing the LRP backward propagation process.
Hence, every neuron in the network will get assigned its own relevance score, up to the input layer neurons which represent the input features' contributions. The backward relevance propagation process of LRP is implemented in practice by specific LRP propagation rules which depend on the type of the network layer involved in the process. These rules have been theoretically justified via the mathematical framework of Deep Taylor Decomposition 53,54 . In the next subsection we will provide more details on these rules.

LRP for a Seq-2-Seq model with GRUs.
In this work we extend LRP to a sequence-to-sequence model with GRU layers. Given an input sequence indexed by t, the GRU cell 31-33 performs the following recurrence computations, as provided in Eq. (2): where h t and x t are the hidden state and the input vectors at time step t, and h (t−1) is the hidden state from the previous time step. r t and z t are the reset and update gates, and n t is the new candidate hidden state. ( W ir W iz W in ) and ( b ir b iz b in ) are input-to-hidden weights and biases. ( W hr W hz W hn ) and ( b hr b hz b hn ) are hidden-to-hidden weights and biases. σ is the sigmoid activation function and is used for gate neurons, while tanh is the tanh activation, and both non-linear activations are applied element-wise. ⊙ denotes an element-wise multiplication. The remaining operations are standard vector addition and matrix-vector product. All lowercase variables are vectors, and uppercase variables are matrices.
Hence a GRU cell consists of the three following types of layers, and for each of them we describe below the LRP rule that we employ to redistribute the relevance scores from the upper-layer neurons onto the lower-layer neurons: • Linear layer. For a linear layer with z i representing the lower-layer neurons, weights w ij and biases b j , in the forward pass the upper-layer neurons z j are computed using z j = i z i · w ij + b j . To compute relevance scores of the lower-layer neurons R i , given the relevances of the upper-layer neurons R j , we follow the LRP-epsilon rule (LRP-ǫ) 7,58 : where R i←j is the share of relevance that neuron i receives from neuron j and ε is a small positive number that is used as a numerical stabilizer. The relevance R i is finally computed as R i = j R i←j , i.e. the sum of all relevance shares that a neuron receives from the upper-layer neurons connected to it. • Multiplicative layer. Another type of layer occuring in a GRU cell is multiplicative interactions where the value of the upper-layer neuron in the forward pass is computed as the product of two lower-layer neurons, which we call gate and signal neurons, where the gate is the neuron that is sigmoid activated. For this type of layer, we utilize the LRP "signal-take-all" redistribution rule (LRP-all) 13,58 , which redistributes all the relevance from the product to the signal neuron. For instance, in Eq. (2) last line, neurons in the hidden state h (t−1) will get assigned all the relevance from the product term z t ⊙ h (t−1) . • Activation layer. In the case of element-wise non-linear activation layer such as tanh and sigmoid, the relevance score from the upper-layer would be transferred to the lower-layer without any changes (identity redistribution). Note that this doesn't imply that the activation function is ignored by the LRP backward www.nature.com/scientificreports/ propagation process. In fact the activated neuron's value is taken into account through the LRP relevance redistribution in the next higher linear layer.
Further, as mentioned earlier, in order to initialize the LRP backward propagation process, the relevance score of the output neuron of interest, i.e. for the pollutant prediction we want to explain, is set to the neuron's predicted value, i.e. to the pollutant's concentration for the considered prediction time step, while the relevances of the other output neurons are set to zero. This predicted quantity will be redistributed layer-by-layer through LRP up until the input neurons, i.e. in our case onto the historical input features, as well as the forecast input features from previous time steps plus the current time step, according to the sequence-to-sequence model's architecture.
In order to check the correctness of our LRP implementation we performed a relevance conservation sanity check, where we verifed that the sum of the input neurons' relevances is equal to the model's predicted output for the pollutant and time step of interest. For this particular sanity check setup we redistributed the relevance assigned to the hidden layer biases uniformly onto lower-layer neurons (as per default in the standard LRP setup a share of relevance is absorbed by the hidden layer biases, and thus relevance is not exactly conserved).
Validating LRP through a toy task. In a previous work the authors designed a toy task to evaluate and compare XAI methods on recurrent networks 13 . Inspired by this approach, we design an arithmetic toy task to verify the LRP relevance scores objectively on our sequence-to-sequence model. As the relationship between inputs and outputs is known, we can use this task to validate the correctness of the LRP relevance results. These relevances should represent the contributions of the input features to the output accordingly to our data generation process.
To perform this experiment, we use a sequence-to-sequence model similar to the pollution forecasting one, but with fewer input features (we use only two input features) and a hidden layer dimension of 128. The data generation process is shown in Fig. 2. Output labels are created by performing simple operations such as sum and division on the sequence of random input features. 100,000 data points are generated with sequence lengths of five for historical inputs and three for forecast inputs, and split into train and validation sets in a ratio of four to one.
Same as with the pollution forecasting model, we have an encoder and a decoder module, each fed with a sequence of inputs, either historical or forecast inputs. According to the data generation process depicted in Fig. 2, to generate an output label y all the inputs from the historical input sequence are summed up and divided by two. The result is then passed to the decoder module. In each prediction time step for generating labels, the forecast input and the output from the previous time step are added together. If the model learned the input-output relationship correctly, which we ensured via model training, then LRP is supposed to compute the relevances in such a way that every element in the historical input contributes to the prediction with half of its value, while the contribution of each element in the forecast input, up to the current time step, should be equal to its value.
Given an input example, the true labels y and the predictions ŷ of the trained model are shown in the box below: www.nature.com/scientificreports/ On the example above, we perform an LRP decomposition on the last time step of the predicted output sequence, i.e. we explain the prediction ŷ 3 . The corresponding relevance scores of the inputs computed with LRP are shown in red, beside the value of the inputs in black, in Fig. 3.
From Fig. 3 we observe that the contributions of historical inputs are almost half of their values, and the relevance scores of forecast inputs are close to their value. Negative inputs get assigned negative contributions to the output due to their subtractive role in the summation. In other words, we can recover the input sequences from the LRP relevances by multiplying relevances of the historical inputs by two and keep the relevances of the forecast inputs unmodified. Thus LRP redistributes relevance scores among inputs as expected.

Explaining pollution forecasts
Our sequence-to-sequence model forecasts the concentration of four major pollutants, namely PM 10 , NO, NO 2 and O 3 for the next 48 h, as was described in the "Sequence-to-sequence model" section. Now using LRP as an explanation method, we analyse the contributions of the input features to high pollutant concentrations, and to this end we begin by performing some statistics on the LRP relevance scores.
For each type of pollutant, we filter 4800 data points with the highest predictions in the first day of the forecast (more precisely for each hour of the first 24 h of the forecast 200 points are selected). For these data points, we compute the LRP relevance scores by explaining the given predictions (for features that are one-hot encoded, we obtain the relevance of the feature by simply summing up the relevances across the input vector dimensions, which is equal to the relevance of the non-zero input vector entry). Then, we compute the average of the positive relevance scores for every hour of the first day of the forecast, and additionally, for a few specific input features' values we compute the average of the positive relevance scores across different hours of the first day of the forecast (in particular for the temporal input features hour of the day, day of the week, month of the year, working day vs. work-free day).
The resulting statistics over input features can be visualized as two-dimensional heatmaps (the more relevant the feature, the deeper the color), where the x-axis represents the prediction hours (i.e., the forecast time steps) and the y-axis represents the input features, as well as, for specific input features where the average was performed However, an important point to note here is that these heatmaps represent the contributions of the input features to the pollution forecast, but do not indicate the corresponding input features' values, i.e. their magnitude and range. For instance, we can not determine from the LRP heatmap whether the low temperature is positively contributing to high pollution, or whether it's high temperature. Therefore we need to average the inputs too in order to determine which input values are positively contributing to high pollution forecasts. Hence we compute the average of the input values on the subset of data points that was previously selected as discussed above, as well as the mean and standard deviation of these features over the whole dataset, and report them in Table 2. For more brevity only the statistics of the most relevant features per pollutant (as identified through the heatmaps, see subsequent sections) are retrieved in Table 2.
In the following we will interpret our LRP results regarding the most relevant input features per pollutant, as well as the distribution of relevance for specific input features' values for high pollution forecasts, and we contrast our findings with domain knowledge from prior works in order to assess the pertinence of an eXplainable AI method such as LRP in the environmental domain. Note that when interpreting the heatmaps (Figs. 4a, 5a, 6a, 7a) we always implicitly back up our findings upon Table 2 to identify which values (high or low) of the input features led to high predictions (by comparing their average on the subset of high prediction data points to their mean value over the whole dataset).
Explaining high PM 10 forecasts. In Fig. 4a we plot the average of the positive LRP relevance for high forecasts of the pollutant PM 10 . We find that the following input features are the most relevant ones for high levels of PM 10 concentration in the air: • High PM 10 measurement. We find that high values of PM 10 measurements in the historical input sequence have the strongest impact on the forecast of high PM 10 concentrations, especially in forecasts for the next hour, this effect then declines over time. PM 10 particles are usually produced from industrial and combustion processes, and through fugitive dust emission. Coarse particles of PM 10 are known to easily deposit, and typically travel less than ten km from their place of generation 63 . Accordingly PM 10 particles that are produced nearby and detected through sensor measurements can be mainly responsible for the amount of PM 10 concentration predicted in the next hours. We find that the locations of stations with high PM 10 concentrations are indeed close to industrial plants such as cement, steel or iron manufacturing plants, along with fossil fuel power plants. For example station DMD2, which is located in the city of Dortmund, has several industrial firms closely situated to it, like the Holcim HüttenZement cement fabricator, the DGW energy plant, and the Thyssenkrupp steel fabricator. Such industries are known to emit a significant amount of particulate matter into the air 64 . • Low temperature forecast. A low temperature forecast also contributes to a high PM 10 load in the air. The phenomenon of temperature inversion might explain this result. Indeed temperature inversion in cold days of the year is known to bring additional accumulation of pollutants in the atmosphere close to the ground 65 . Temperature inversion happens when the warm air floats above the cooler air near the surface. This warm layer of air can play the role of a lid and trap the pollution close to the ground. Besides, the high pollutant concentration during cold days can also be a consequence of additional emissions due to domestic heating, e.g. through fossil fuel combustion 66 .
In Fig. 4b, the contributions of different hours of the day to the high PM 10 concentrations are shown. Even though the hour input feature does not have a high relevance score overall for the prediction of PM 10 , compared to the historical PM 10 measurement and the forecasted temperature (as seen in heatmap Fig. 4a), we can still determine which hours of the day are more relevant than others, relatively. As it can be observed from the bar  Explaining high NO forecasts. Similar to the experiment with PM 10 , we plot the average of the positive LRP relevance for high forecasts of NO in Fig. 5a. We find that the following input features contribute most towards high NO concentration: • Low wind speed forecast. Wind usually plays a role in reducing the pollution load in an area by dispersing the pollutants. Conversely an absence of wind allows for pollutants to pile up. As a result, low wind speed becomes a contributor to an increase in pollution load. This could explain the high relevance scores we observe for low wind values with respect to high NO predictions, accordingly to Fig. 5a and Table 2. This result is also consistent with previous studies that found a strong negative correlation between NO concentration and wind speed [67][68][69] . • Low temperature forecast. As we observed with the pollutant PM 10 , we find that low temperature contributes to the prediction of high NO concentration. Again the reason could be the temperature inversion phenomenon or the significant increase in burning fossil fuels for heating and electricity generation during cold days 70,71 . • High humidity forecast. The heatmap in Fig. 5a also suggests a positive contribution of high humidity to high NO concentration. NO generated by natural or anthropogenic sources is a reactive gas and rapidly oxidizes to NO 2 , in particular through reaction with ozone. Due to this reaction large concentration of ozone and NO can not coexist 72 . Since high humidity is generally associated with lower solar radiation and less ozone formation, we hypothesise this could explain why fewer nitrogen monoxide is converted into nitrogen dioxide when the relative humidity is elevated, allowing NO to reach higher concentration. In the previous literature however we could not find such a relationship between high NO pollution load and high relative humidity. Thus our interpretation of this result shall be taken with caution, and there might be other mechanisms at play that better justify the high humidity contribution to the model's prediction. • Morning hours in the forecast. Finally Fig. 5a shows high relevance scores for the hour of the day feature in the forecast input sequence. Additionally, Fig. 5b represents the average contribution of each hour of the day to high NO predictions across various forecast time steps. The hours of the day that positively contribute the most to the NO load in the air are around 7 A.M.-8 A.M. in the morning. This is consistent with the diurnal variations of pollutants found in previous work 8 , presenting a major peak of NO concentration that starts early in the morning and lasts till late morning. Generally, high emissions of NO x coincide with rush-hour traffic occurring in the morning and late afternoon. However, in the late afternoon, the level of ozone concentration increases because of the presence of solar radiation during the day that promotes ozone generation. As a result, due to the oxidation of NO to NO 2 through the chemical reaction previously  73 . In Fig. 5b, beside the average contribution of each hour input feature to high forecasts (as a bar plot), we additionally plot the average occurrence of high NO forecasts for each prediction time step (as a dotted line). We observe that the distribution of high forecasts of NO approximately follows the same pattern as the contribution of the input feature hour of the day, but with a slight delay of around 2 h. This is probably due to the accumulation effect of pollution, meaning that NO produced at relevant input hours accumulates into the air to increase the NO prediction at subsequent hours.
Explaining high NO 2 forecasts. Figure 6a depicts the average positive LRP relevance for high forecasts of NO 2 . We find the following input features to contribute substantially towards high levels of NO 2 : • Low humidity forecast. Nitrogen dioxide NO 2 is an acidic gas and readily reacts with water vapor to generate nitric acid and nitrogen monoxide through the reaction of 3NO 2 + H 2 O −→ 2 HNO 3 + NO . This is also the reaction that causes acid rain formation from nitrogen dioxide present in the atmosphere. Therefore, when humidity is high NO 2 can be consumed and removed from the air through nitric acid deposition, and conversely, when the humidity is low this could contribute to NO 2 accumulation 74 . • Moderately high temperature forecast. As can be seen from Fig. 6a and Table 2, an average temperature of 15 °C (with standard deviation of 9.6 °C) seems to act in favor of a high NO 2 concentration. We suspect this temperature level to not be low enough to explain higher emissions due to fossil fuels combustion for heating or accumulation of pollutant through temperature inversion, and at the same time to not be high enough to justify a high ozone concentration which could lead to a secondary formation of NO 2 from oxidation of NO. Previous work found that indeed the NO 2 concentration is rather affected by temperature levels either lower than 5 °C or higher than 20 °C 75 . Therefore it is possible that other mechanisms are at play here to increase NO 2 formation, e.g. via nitrogen oxides emissions from micro-organisms present in the soil 76 . More investigations would be needed to understand the reasons behind the role of temperature in this context, as well as possible interactions with other factors not modelized in our approach. • Low wind speed forecast. Similarly to what we observed with the pollutant NO, the low wind speed also allows NO 2 to accumulate since the dispersion and transport of pollution is hindered, as opposed to when the wind speed is high.
According to the bar plot in Fig. 6b, the time of the day around 8 P.M. contributes the most to high NO 2 forecasts, and late afternoon hours contribute more than morning hours. This could be attributed to the contributions from evening rush-hour traffic, emissions due to heating, and the secondary formation of NO 2 from the reaction between primary emitted NO and O 3 77 . • High temperature forecast. Elevated temperature displays a positive contribution to the ozone pollution load. The reason is that higher temperature is representative of more solar radiation, which supports the photochemical formation of O 3 . Indeed solar energy is known to play a key role in ozone formation (see e.g. 9,10,78,79 ). • Low humidity forecast. As already mentioned, ground-level tropospheric ozone is formed as a result of complex photo-chemical reactions 9,10,79 . The corresponding cyclic reactions involve NO x , Volatile Organic Compounds (VOCs) and CO. High humidity is generally associated with cloudy sky and precipitation, resulting in weaker solar radiation, while on the other hand low humidity favors solar radiation which enhances ozone production 80 . This explains in great part the high contribution of a low humidity forecast on ozone concentration observed in Fig. 7a. Another complementary explanation for this result might be dry deposition, i.e. the uptake of ozone by trees 11 . Indeed when the relative humidity is high trees open their stomata for exchanging CO 2 , and then unintentionally also absorb O 3 from the air, while when the humidity is low trees close their stomata such that they don't dry out, and consequently they do not remove O 3 from the air, allowing it to pile up.
In Fig. 7b we visualize the average contribution of the input feature hour of the day for high O 3 forecasts. We observe that the most relevant hours are during daytime, and especially between noon and 5 P.M. This is consistent with the diurnal cycle of solar radiation and its impact on O 3 concentration, with a maximum concentration occuring around 3 P.M. (i.e. approx. 2 h after the maximum of solar radiation) as was found in previous work 78 . Further statistics on the relevance for other temporal input features like months of the year, days of the week and work-free versus working day for high pollution forecasts of the different pollutants can be found in the Supplementary Section 1.

Influence of wind direction.
In the previous subsections we analysed the relevance of various input variables on the forecast of high pollutant loads. In this subsection we investigate in more details a particular input feature which contributes to high forecasts for the pollutants NO and NO 2 : the wind direction. To determine the most important wind directions, the relevance of the different wind directions is computed for one specific station. In particular we consider the station BIEL in Bielefeld which has high forecasts of NO. We average the positive LRP relevance of the input feature wind direction for that specific station across different hours of the prediction time with high forecast values. This statistic is computed over 71 data points with the highest forecasts of NO. The result is shown in Fig. 8. According to the bar plot, wind that blows in the direction between 22.5°-45° enhances pollution the most, while the next peak can be found in the range 157.5°-180°. www.nature.com/scientificreports/ The location of the measuring station and the two most influential wind directions are depicted in a map in Fig. 9. As reported by the European Environment Agency 81 , energy production, as well as road and non-road transport generate a large share of NO emissions. In accordance with that we find two sites that could be responsible for the high emissions and which are consistent with the relevant wind directions identified via LRP: the Bielefeld airport and an energy production plant. This illustrates the usefulness of LRP for identifying potential relevant pollution sources.

Simplifying the model with LRP
According to the statistics on the positive, respectively the absolute, LRP relevance values computed over 19,200 data points with the highest forecasts for different pollutants in the first day of the prediction (4800 data points per pollutant) as shown in Fig. 10, we observe that the historical input data, except for the historical measurement of PM 10 , do not contribute significantly to the pollution prediction. This indicates the model is mainly relying on the nearby forecast input data to make its prediction. Considering this observation, we explored whether we could simplify the model's structure by removing the encoder module from the model and re-train a new model with only the decoder module without loosing much prediction performance. In order to keep feeding the model with the last value of the historical measurement of PM 10 , we included this value as input into the first time step of the decoder. The resulting validation loss and the training time for the modified and original models are given in Table 3. While the original encoder-decoder model receives a combination of historical and forecast inputs for training and prediction, and the decoderonly model just consists of the decoder fed with forecast data plus one measurement of PM 10 , we find that both models reach reasonable prediction performance. Indeed there is only a slight increase in the validation loss when using the decoder-only model (in terms of mean absolute prediction error). We provide some example predictions on a few data points from the validation set in Fig. 11. This experiment demonstrates that explaining the model behavior with LRP can also be used to design a more efficient model. By omitting the encoder module, we finally obtain a model that needs less time and computational resources for training and prediction. The resulting model looses a bit of accuracy, but still solves the prediction problem with enough precision to fulfil practical expectations.
Similarly, we inspected the impact of removing a few irrelevant features as identified by LRP from the historical input sequence of the original encoder-decoder model, and verified whether this affects the prediction performance negatively, as was done in previous work 41 . For this analysis we discarded the following historical input features: year, weekday, work-free day, sunshine and precipitation (i.e. rain), which never were assigned a high relevance for the high pollutant forecasts (see relevance heatmaps in the previous section and in Fig. 10). In terms of validation loss for each pollutant, when we re-trained a new model with this modification, the resulting change in prediction performance was no more than 0.02 μg/m 3 , which is negligible. Hence the LRP relevance can not only help to understand and explain the key factors influencing the pollution forecast the most, but can also be used to improve model efficiency, and reduce the number of input features.
Lastly, we inspected the influence of the static input feature station number (which is represented as an integer indicating the location of the considered station, and can in principle be used by the model to adapt its predictions to the environmental specificities of a given location). According to the LRP heatmaps, this input feature also never was assigned a large relevance value for high pollutant forecasts. Therefore we would expect it can www.nature.com/scientificreports/ be discarded from the input data without much loss in prediction performance. And indeed when we re-train the original encoder-decoder model without that feature, we obtain a new model with an equivalent prediction performance (no more than 0.02 μg/m 3 change in prediction performance per pollutant). This indicates that the remaining dynamic input features (such as weather forecast, previous pollution concentration and temporal features) contain enough information for the given pollution forecasting task at hand. For a comparison of the LRP results with other XAI methods, we provide in the Supplementary Section 2 additional heatmap visualizations with the XAI methods of Saliency and Gradient x Input 46,47 .

Conclusion
In this work we applied the LRP method on a trained sequence-to-sequence pollution forecasting model with GRU layers. The explanations produced by LRP helped us to better understand how the model makes its predictions. LRP heatmaps were qualitatively inspected, and their results were largely consistent with available domain knowledge. We expect LRP to be broadly useful for the fine-grained analysis of model behavior in the environmental domain, in particular for the identification of potential critical factors contributing to high pollution concentration, which could improve the air quality monitoring and mitigation. Further, we demonstrated that LRP can be used to design a computationally more efficient forecasting model. A possible future direction for investigation could be the identification of factors interacting together in the pollution forecast, e.g. through the successive removal of individual versus groups of variables, and their interpretation in terms of the involved chemical and physical processes. In this respect we believe that data driven and eXplainable AI could complement more traditional approaches for broadening the understanding of pollution mechanisms.

Data availability
The data is not publicly available. The data analysed in the study will be available from corresponding author on reasonable request. The code of the model is publicly available at https:// github. com/ Sara-mibo/ LRP_ Encod erDec oder_ GRU.